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© Method and apparatus for providing continuous availability of applications in a computer network. 



© The method and apparatus for maintaining active 
sessions between communicating logical units (10, 
40) in a computer network when an application sys- 
tem fails without having to re-establish the active 
sessions, are carried out by activating a persistent 
session capability at one of the logical units. Thus, 
the active sessions can be suspended and main- 
tained while attempts at recovery are made. Recov- 
er ery attempts include restarting the failed application 
or switching the suspended sessions to an alternate 
W instance of the logical unit (10). The suspended 
^sessions are re-synchronized with the application 
system and session activity is resumed. 
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METHOD AND APPARATUS FOR PROVIDING CONTINUOUS AVAILABILITY OF APPLICATIONS 

COMPUTER NETWORK 



The invention relates generally to computer 
networks and more particularly to a method for 
recovery from application system failure in order to 
provide continuous availability of the application 
system. 

Prio< art computer networks are controlled by a 
system architecture which insures the orderly flow 
of information throughout the system. Systems net- 
work architecture (SNA) is a system architecture 
developed by IBM Corporation which controls the 
configuration and operation of a computer commu- 
nications network. It provides the description of the 
logical structure, formats, protocols, and operational 
sequences for transmitting information units 
through the network. 

The network is composed of nodes intercon- 
nected by communications facilities. The notes 
may be of widely varying functional capability, 
ranging from terminals with minimal native process- 
ing capability to complex multiprocessors. The 
communication facilities also come in a number of 
varieties ranging from high speed I/O channels to 
low speed, point-to-point telephone lines and in- 
cluding such media as satellite finks and wide-band 
optical fibers. 

Each note is comprised of a physical unit (PU) * 
which controls the physical resources of the note 
(e.g.. links) and one or more logical units (LU) 
which are used to partition, allocate, and control the 
devices associated with end-user communications. 
The Virtual Telecommunication Access Method 
(VTAM) is a telecommunications access method 
software program, developed by IBM Corporation, 
which is resident in a host processor and other 
resources in the computer network. A VTAM ap- 
plication program is a program that uses VTAM 
macro instructions to communicate with terminals. 
VTAM allows a plurality of application programs to 
be used at a single terminal. An application pro- 
gram within a host processor can be used at any 
location in the network without the program having 
any awareness of network organization. 

Users in the network communicate by estab- 
lishing a session between the logical units (LU) that 
represent them. A session involves a definition of 
the characteristics of the communication between 
two end-users. Each logical unit couples a user to 
the SNA network. Two logical units can have mul- 
tiple logical connections or parallel sessions estab- 
lished between them. 

Currently, wh n a network application fails, all 
of th sessions of the application ar terminated 
(unbound). Application recovery requires the ses- 
sions to be re-established. This process is slow. 



thereby causing application recovery to take an 
unacceptabty long time, especially if there was a 
large number of sessions. 

Any fault tolerant solution requires two basic 
s ingredients redundancy and state recording. Re- 
dundancy may come in the form of duplicate hard- 
ware and software, along with the appropriate ac- 
cess paths (e.g.). busses, links, cache, etc.). State 
recording is a normal processing such mat when a 
10 fault occurs and recovery is invoked, a consistent 
"next" state can be constructed in order that the 
process can continue properly. 

One solution to this problem has been to add 
additional hardware and software system elements 
is to create an alter nate application subsystem which 
is kept synchronized with the active subsystem. 
For example, an alternate processor, with the same 
type of application program can establish back-up 
sessions for any of the sessions that the primary 
20 host processor has active currently. If the primary 
processor was unable to perform its function for 
any reason, such as hardware, operating system, 
VTAM or application failure, the alternate processor 
could be used immediately to service the users 
25 that had active sessions with the primary proces- 
sor. A major drawback to this approach is that it 
requires purchase of redundant processor hardware 
and software. Moreover, a separate back-up ses- 
sion is required for each active session. 
30 It is therefore an object of this invention to 

provide a method that addresses failure of a net- 
work application system by switching from the fail- 
ing program to an alternate system or restarting the 
failed system, without having to re-establish all 
35 sessions. 

It is a further object of this invention to provide 
a method for recovery from an application system 
failure that does not require that back-up sessions 
be established. 
40 The objectives of the present invention ar 

achieved, in the event of an application failure, by 
the resident communications access method, .G., 
VTAM, suspending the active sessions, maintaining 
session state information in memory outside the 
45 address space of the affected logical unit, and 
resuming the suspended sessions. This invention 
allows a failing application to recover either by 
restarting or by transferring control to an alternate 
copy. This method is embodied in the concept of a 
so "persistent session" between logical units. 

The invention will be more understood in the 
following description made in reference to the ac- 
companying drawings wherein : 

Fig. 1A shows a logical unit to logical unit 
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connection befor a logical unit fails. 

Fig. 1G shows an alternate logical unit to 
logical unit connection after recovery takes place. 

Fig. 2 is a graph of th persistent s ssion 
finite state machine (FSM). 

Rg. 3 is a table showing state transitions in a 
persistent session FSM. 

Several recovery alternatives are included in 
the present invention. Recovery can be achieved 
locally on the same operating system image. One 
alternative is to restart the application on the same 
operating system after abnormal termination pro- 
cessing is complete. Another alternative is to start 
a copy of the application that runs under the same 
operating system before the failure but in a dif- 
ferent address space (local alternate). The applica- 
tion copy can initiate recovery processing when it 
is signalled that the active application has failed or 
is in the process of failing. Remote recovery is 
required for catastrophic failure resulting from faults 
in the operating system. VTAM or the hardware; 
however, it can be used in any case for which local 
recovery is not desired (e.g., operating be restarted 
on the remote system after the failure. Likewise, 
recovery can be accomplished by transferring con- 
trol to a previously started copy of the application 
on the remote system. 

Some subsystem applications are not struc- 
tured in a manner that allows local alternates for 
recovery. Thus, it is essential to restart the failed 
subsystem as fast as possible in order to minimize 
the outage to the end user. Persistent sessions 
enable sessions to be preserved across an applica- 
tion failure and during the subsequent restart 
Reacquiring sessions during restart is extremely 
fast because costly session establishment is elimi- 
nated and no external flows are generated. 

For applications that do permit local alternates 
for recovery, the local alternate is initialized and 
waiting to initiate recovery should the primary ap- 
plication fail. One key property of the persistent 
session is that the local alternate maintains the 
same name as the active logical unit This enables 
any existing dependent tasks to easily re)establish 
linkages with the recovery logical unit and re-syn- 
chronize without the need for monitoring additional 
sessions or requiring human intervention. Persis- 
tent sessions provide recoverabtlrty for ail sessions, 
regardless of their logical unit types. Furthermore, 
both primary and secondary logical units can be 
recovered regardless of the manner in which the 
device containing the partner logical unit Is at- 
tached to the network. 

The persistent session capability Is nabled as 
a VTAM feature by additional parameters on the 
statements used to defin the application as a 
logical unit In the computer n twork. If the persis- 
tent session capability is used by an application 
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logical unit then all of its sessions inherit the 
"persistent" attribute. 

When an application logical unit fails or an 
alternate logical unit initiates a take-over, the active 

s sessions of the logical unit are disconnected from 
any application logical unit for some period of time. 
VTAM assumes re sponsibility for the active ses- 
sions from logical unit indicates that it is prepared 
to assume responsibility for the sessions. During 

ro the "outage", the sessions are suspended. Sus- 
pended sessions are in a recovery-pending state. 
An important observation is that activity may still 
occur on a suspended session, e.g., data may still 
be in transit A suspended session is handled in 

75 such a way that the correct session resources 
remain allocate and that the session states are 
tracked for future recovery actions (re-synchroniza- 
tion). 

Session tracking begins at the time the session 

20 is created and continues until the session is termi- 
nated, continuing even when a session is sus- 
pended. During session tracking, VTAM saves the 
session state information for each data request unit 
(RU) sent or received on the session. This is nec- 

25 essary because the current session state must be 
passes to the recovery logical unit when it as- 
sumes responsibility for the session. In addition, a 
correct session protocol must be preserved in or- 
der to prevent session termination due to a pro- 

30 tocol error. 

If an application fails and subsequently at- 
tempts to recover via restart or a local alternate 
and the recovery processing fails, the application 
can reactivate the persistent session capability dur- 

35 ing the recovery interval so that normal termination 
processing can be executed. Subsequently, after 
application recovery is complete, the persistent 
session capability is again activated to guard 
against subsequent application failures. To ensure 

40 that sessions are not suspended indefinitely due to 
failure and recovery procedures, a safety timer is 
required so that eventually, session clean-up can 
be executed. 

VTAM provides three additional notifications to 

45 a network session monitor to reflect the recovery 
status of persistent sessions : ' . v 

1. Recovery pending - this indicates that 
sessions for a given application- logical unit have 
been suspended but no alternate logical unit has 

so been initiated to resume session activity; 

2. recovery in progress - this indicates that 
an alternate logical unit is active, nut not all ses- 
sions have been processes; 

3. recovery complet ) this indicat s that all 
55 sessions have been recovered. 

Th session monitor is a function of a network 
management programs as ex mplified by th IBM 
software product NetVcew™. 

3 
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In the basic embodiment of the persistent ses- 
sion capability, although session states are tracked 
an preserved. the actual application (RU) data can 
be discarded. Discarding the data will, in many 
cases, lead to a lack of failure transparency to the 
end user. In an alternate embodiment instead of 
discarding application data during the outage pe- 
riod, the data can be queued for subsequent deliv- 
ery to the recovered application. 

Referring to Figs. 1A and 1B, there is shown a 
logical Representation of a computer network. A 
session is depicted from LUx in address space 10 
to LUy in address space 40. The labeled boxes 
represent address spaces in which the application 
LUs execute. In the communication services com- 
ponent 30. a session control block (SCB) 32 and 
memory 34 for recording the session state informa- 
tion are maintained. Address space 20 depicts a 
recovery (alternate) instance for LUx. The connec- 
tion labeled 22 in Fig. 1 A depicts the session 
appearance in address space 20 when processing 
is proceeding normally. When LUx fails in address 
space 10. connection 22 is broken. Connection 24 
as shown in Fig. 1A does not exist as long as the 
active LUx in address space 10 executes normally. 
If the active LUx fails, however, and another in- 
stance of LUx is started or triggered to perform 
application recovery, then the recovering LUx in- 
stance in address space 20, invokes the commu- 
nication services 30 to resume the session be- 
tween LUx and LUy. resulting in connection 24 as 
shown in Fig. IB. 

Depending on the type of recovery, address 
space 20 can be viewed as a restarted version of 
LUx in address space 10. as another address 
space that contains a local alternate, or as an 
alternate in another host system. In the latter case, 
the connection between the communication ser- 
vices 30 and the address space that contained the 
active logical unit requires a communication access 
via a channel, bus, or high speed link. When LUx in 
address space 10 fails, communications services 
30 maintains the session resource (SCB) 32 and 
continues to keep the session status current Thus, 
during the entire outage period, the state of the 
session from the viewpoint of the network is cor- 
rectly maintained. Since the application LU in ad- 
dress space 10 has failed, information relating ac- 
tivity on the session with the application state may 
have been lost During recovery, the application 
issues communications services 30 commands to 
retrieve the session states, resume ownership of 
the session, and resynchronize the session with th 
application state. 

VTAM will track and retain information on each 
LU-LU session established by an application after 
the application opens an access m thod control 
block (ACB) for which persistent s ssions hav 



been specified. If persistence has not been started 
by an application and the ACB is closed, all LU-LU 
sessions with that application will be terminated in 
the same way as currently. However, if persistence 
5 has been started and the ACB is closed, these 
sessions will not be terminated. Instead, VTAM 
suspends the sessions and waits for the application 
to recover. It also notifies the session monitor that 
recovery is pending and activates a safety timer if 
w one has been specified by the application. If the 
application cannot recover (i.e., reopen its ACB) 
before the timer expires, all suspended sessions 
are terminated. LU-LU data received by VTAM 
while a session is suspended will be queued. 
75 After the application has recovered and re- 
opened its ACB. VTAM indicates that the applica- 
tion is persistent and informs the session monitor 
that recovery is in progress. The application re- 
quests information about the suspended sessions 
20 from VTAM and specifies on a per-session basis 
whether or not each session is to continue. If the 
indication is to continue, the session reverts to the 
active state. If the indication is not to continue, the 
session is terminated. After the last suspended 
25 session becomes active or is terminated, the ses- 
sion monitor is notified that a recovery is complete. 

Fig. 2 describes the persistent session capabil- 
ity in terms of a finite state machine (FSM) graphic 
representation. Fig. 3 describes persistent session 
30 capability in terms of FSM table representation 
wherein the numbers in the table represent the 
next state to which the current state transitions. 
The persistent session FSM consists of six states 
and six signals. The six states are as follows: 
35 RESET; LOGICAL UNIT ACTIVE/PERSISTENT 
SESSIONS INACTIVE; LOGICAL UNIT 
ACTIVE/PERSISTENT SESSIONS ACTIVE; RE- 
COVERY PENDING; RECOVERY-IN- 
PROGRESS/PERSISTENT SESSIONS ACTIVE; 
40 RECOVERY-IN PROGRESS/PERSISTENT SES- 
SIONS INACTIVE, the six signals are OPEN, 
CLOSE, activate persistent sessions (PERSIST), 
deactivate persistent sessions (N PERSIST), recov- 
ery complete (REC COMP), and time out 
45 (TIMEOUT). The OPENS signal connects the ap- 
plication to the communications services compo- 
nent; the CLOSE signal disconnects the applica- 
tion. The six states are represented by blocks in 
Ftg. 2 with the top part of each block containing a 
so state number form 1 to 6, and the lower part of 
each block containing a state description. In Rg. 3, 
the current state is labeled at the top of each 
column by a state numb r and state d scription. 
Th re ar two categories of processing asso- 
55 dated with a finit state machine. L ., processing 
associated with a given state and processing asso- 
ciated with a transition from one state to another. 
Referring now to Rg. 2 and the persistent session 
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FSM table in Fig. 3, the basic states and transitions 
will be xptained. Th state transition proc ssing 
for a given state transition is labeled by the signal 
above each directed arrow. State 1. the RESET 
state, represented by block 100 is a state before 
the persistent session logical unit is active. The 
OPEN signal results in initial processing of an 
access method control block and the setting up of 
VTAM for session tracking. The only transition from 
RESET is to state 2, block 102, which represents 
the LOGICAL UNIT/PERSISTENT SESSIONS IN- 
ACTIVE state, as the state name implies, the logi- 
cal unit is active but the persistent session capabil- 
ity is still inactive. The application may create, use 
and terminate sessions. All active sessions are 
tracked while in this state but the logical unit will 
terminate if a CLOSE signal is issued. The termina- 
tion close processing results in the unbinding of all 
sessions for this logicaJ unit and the cleaning up of 
resources for this logical unit. A CLOSE signal 
returns the FSM representation to the RESET state, 
block 100. from block 102 (state 2). a PERSIST 
signal will cause a transition to block 104 (state 3) 
which represents the LOGICAL UNIT ACTIVE/ 
/PERSISTENT SESSIONS ACTIVE state and en- 
ables subsequent OPEN signals in case a take- 
over is required. An OPEN signal issued from state 
2 will result in an error signal (Fig.3) that indicates 
an access method control block (AC8) is already 
open and will cause a return to the same state. In 
state 3. logical unit activerpersistent session active, 
block 104, the application may create, use and 
terminate sessions. In this state all active sessions 
are tracked. Since the persistent sessions capabil- 
ity is enabled, if the application is terminated or 
another OPEN signal is issued for this logical unit, 
all active sessions are preserved. Three signals 
can be issued when in this state, namely, OPEN, 
CLOSE, and deactivate persistent sessions 
(NPERSIT). The NPERSIST signal disables the 
persistent session function thereby causing a- tran- 
sition to block 102 (state 2) LOGICAL UNIT 
ACTIVE/PERSISTENT SESSIONS INACTIVE. 

If an OPEN. signal is issued in state 3, there is 
a transition to state 5, RECOVERY IN 
PROGRESS/PERSISTENT SESSIONS ACTIVE, 
which is. represented by block 108. This signal 
causes the switching of sessions to the take-over 
task. The actual sequence of operations is to sus- 
pend ail sessions of the logical unit; to disconnect 
logical unit resources from the current task asso- 
ciated with the access method control block (ACP); 
to connect logical unit resources to the ACB of the 
take-over task; to notify the session monitor that 
recovery is in progress; and to return an indication 
that the logical unit is persistent to the take-over 
task. 

If a CLOSE signal is issued in state 3, there is 
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a transition to stat 4, RECOVERY PENDING, re- 
presented by block 106. Transitioning to this stat 
results in close proc ssing for suspension of ses- 
sions. This causes suspension of all sessions of 

s the logical unit, the discard of queued session 
requests, the discard of queued request units (RU) 
if data queueing is not in effect the switch of 
logical unit resources from the closing access 
method control block (ACB) to VTAM, the notifica- 

io tion of RECOVERY PENDING state to the session 
monitor, and the start of the safety timer. 

The RECOVERY PENDING state (state 4) re- 
presented by block 106, is entered when another 
recovery (ALTERNATE) instance of the logical unit 

is does not exist; While in this state, VTAM will 
"track" all active sessions as well as handle com- 
munications events that relates to these sessions 
as, for example , the receipt of an unbind request 
An open signal issued when in state 4 will initiate 

20 take-over of the suspended sessions and cause a 
transition to state 5, RECOVERY IN 
PROGRESS/PERSISTENT SESSIONS ACTIVE, 
block 108. The specific steps followed are to con- 
nect logical unit resources to the ACB of a take- 
rs over task, to notify the session monitor that recov- 
ery is in progress, to return an indication that the 
logical unit is persistent to the take-over task and to 
reset the safety timer. The TIMEOUT signal is 
generated when the safety timer expires. A 

30 TIMEOUT signal forces the logical unit from stat 4 
back to 0, the RESET state, represented by block 
100. Therefore, following a TIMEOUT, all sessions 
are terminated, and any subsequent OPEN signal 
causes a fresh instance of the logical unit to be 

35 established. 

When in state 5, RECOVERY IN 
PROCESS/PERSISTENT SESSIONS ACTIVE, re- 
presented by block 108, the logical unit has been 
activated and the persistent session capability is 

40 active. The application may create, use, and termi- 
nate sessions. In addition, the application is ex- 
pected to take actions on all sessions that are 
suspended on its behalf. The possible actions are 
resuming session activity or terminating the ses- 

45 sion. The logical unit is considered in the 
"recovery-in-progress" state until fee last session 
is recovered, a recovery complete (REC COMP) 
signal changes the state to state 3, LOGICAL UNIT 
ACTIVE/PERSISTENT SESSIONS ACTIVE, block 

so 104. 

If a failure occurs while in state 5, all th 
sessions of the logical unit will still be preserved 
and will again be market suspended. This is in- 
dicated by the op n signal which keeps the logical 
55 unit in state 5. A CLOSE signal issued while in 
state 5 will cause a transition to state 4, RECOV- 
ERY PENDING, block 106, performing close pro- 
cessing for the suspension of sessions In the same 

5 
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way as was done by the CLOSE signal which 
causes a transition from state 3 to state 4. The 
persistent session capability can be deactivated 
while in state 5 and will cause a transition to state 
6. RECOVERY IN PROGRESS/PERSISTENT SES- 
SIONS INACTIVE, represented by block 110. While 
in state 6. the logical unit is taking recovery actions 
to resume its suspended sessions as was done i n 
state 5. However, if the logical unit fails while in 
this state, normal termination processing is ex- 
ecuted and all sessions are unbound. Besides a 
CLOSE signal, a PERSIST signal or a REC COMP 
signal can also be issued while in state 6, A PER- 
SIST signal causes a transition back to state 5. A 
REC COMP causes a transition back to state 2, 
LOGICAL UNIT ACTIVE/PERSISTENT sessions in- 
active. The REC COMP signal indicates that all 
session recovery actions for the logical unit have 
been completed which means that the recovery 
logical unit instance has taken the appropriate ac- 
tions on all sessions that have been suspended on 
behalf of the logical unit An attempt to issue an 
OPEN signal while in state 6 will result in an error 
message (Rg-3) and the logical unit will remain in 
that state. 

While the invention has been particularly 
shown and described with reference to the particu- 
lar embodiment thereof, it will be understood by 
those skilled in the art that various change in form 
and details may be made therein without departing 
from the spirit and scope of the invention. 



Claims 

1. A method for providing continuous availabil- 
ity of applications by preserving application ses- 
sions between pairs of communicating logical units 
located at a plurality of nodes In a computer net- 
work, said computer network having a telecom- 
munications access method program to control 
communication between network resources and a 
session monitor to interface with an operator; said 
method being characterized in that it comprises the 
steps of : 

activating a persistent sessions capability at one of 
the logical units, 

suspending the active sessions of the logical unit, 
maintaining the status of the suspended sessions 
during the recovery phase, 

initiating recovery actions to resume the suspended 
sessions. 

switching the susp nded sessions to a take-over 
task, and 

r suming sessions activity on at least one of the 
suspended sessions. 

2. The method as claimed in Claim 1 including 
th step of notifying th session monitor that th 



application system is in a recovery pending state. 

3. The m thod as claimed in Claim 1 or 2 
further includ ing the step of activating a safety 
timer when all suspended active sessions have 

5 been placed into a recovery pending state. 

4. The method as claimed in Claim 1, 2 or 3 
including the step of terminating at least one of the 
suspended sessions. 

5. The method as claimed in one of the pre- 
io ceding Claims including the step of notifying the 

session monitor that the recovery of the application 
sessions is completed. 

6. The method as claimed in one of the pre- 
ceding claims wherein the step of switching the 

75 suspended sessions to a take-over task includes : 
disconnecting logical unit resources from the cur- 
rent task, 

connecting logical unit resources to the take-over 
task, 

20 notifying the session monitor that session recovery 
is in progress; and 

returning an indication that the logical unit is per- 
sistent to the take-over task. 

7. The method as claimed in any one of Claims 
25 3 to 6 further including the step of terminating all 

suspended sessions if the safety timer expires be- 
fore the application sessions have recovered. 

8. The method as claimed in any one of the 
preceding Claims wherein the step of maintaining 

30 the suspended sessions during the recovery phase 
further includes : 

keeping all session resources allocated, 
tracking and preserving session states for subse- 
quent re-synchronization with the application sys- 
35 tern during recovery actions, and 

handling all session requests that occur during the 
outage period. 

9. The method as claimed in Claim 8 wher in 
the step of maintaining suspended sessions during 

40 the recovery phase further includes discarding ac- 
tual application data received on a given session 
during the outage period. 

10. The method as claimed in Claim 9 wherein 
the step of maintaining suspended sessions during 

45 the recovery phase further includes queuing the 
application data received n a given session th 
outage period in a data space for subsequent pro- 
cessing. 

11. An apparatus for retaining application ses- 
50 sions between a pair of communicating logical 

units located at a plurality of nodes in a computer 
network wherein a network application system is 
running at one of said communicating logical units, 
said apparatus comprising : 
ss means for activating a persistent s ssions capabil- 
ity at th logical unit running th application sys- 
tem, 

m ans for suspending th application sessions be- 
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tween said logical units, 

means for maintaining the status of th suspended 
sessions during the recovery period, 
means for restarting the application system, 
means for resuming ownership of the suspended 
application sessions at the affected logical unit and 
means for reporting the status of the suspended 
sessions to the application system. 
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